None CustomerSegmentation-Bank

Data Analysis & Visualization  - Michael Cheng

Project Problem Statement - AllLife Bank Customer Segmentation

Background

Context

AllLife Bank wants to focus on its credit card customer base in the next financial year. They have been advised by their marketing research team that the penetration in the market can be improved. Based on this input, the marketing team proposes to run personalized campaigns to target new customers as well as upsell to existing customers.

Another insight from the market research was that the customers perceive the support services of the bank poorly. Based on this, the operations team wants to upgrade the service delivery model, to ensure that customers' queries are resolved faster. The head of marketing and the head of delivery, both decide to reach out to the Data Science team for help.

Objective

Identify different segments in the existing customer base, taking into account their spending patterns as well as past interactions with the bank.

Data Description: Data is available on customers of the bank with their credit limit, the total number of credit cards the customer has, and different channels through which the customer has contacted the bank for any queries. These different channels include visiting the bank, online, and through a call center.

  • Sl_no - Customer Serial Number

  • Customer Key - Customer identification

  • Avg_Credit_Limit - Average credit limit (currency is not specified, you can make an assumption around this)

  • Total_Credit_Cards - Total number of credit cards

  • Total_visits_bank - Total bank visits

  • Total_visits_online - Total online visits

  • Total_calls_made - Total calls made

Import Libraries & Load Data

In [1]:
import plotly.io as pio
pio.renderers.default = 'notebook'

from IPython.display import HTML
HTML('''<script src="https://cdn.plot.ly/plotly-latest.min.js"></script>''')
Out[1]:
In [2]:
import pandas as pd

# Importing PCA and t-SNE
from sklearn.decomposition import PCA

from sklearn.manifold import TSNE

# Summary Tools
from summarytools import dfSummary
data2 = pd.read_excel("/mnt/e/mikecbos_E/Downloads/MIT_Elective-AllLife/Credit+Card+Customer+Data.xlsx")

Data Preprocessing

In [3]:
# Copy of data
df2 = data2.copy()

# Overview of data
print(df2.head())
df2.info()
dfSummary(df2)
   Sl_No  Customer Key  Avg_Credit_Limit  Total_Credit_Cards  \
0      1         87073            100000                   2   
1      2         38414             50000                   3   
2      3         17341             50000                   7   
3      4         40496             30000                   5   
4      5         47437            100000                   6   

   Total_visits_bank  Total_visits_online  Total_calls_made  
0                  1                    1                 0  
1                  0                   10                 9  
2                  1                    3                 4  
3                  1                    1                 4  
4                  0                   12                 3  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 660 entries, 0 to 659
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype
---  ------               --------------  -----
 0   Sl_No                660 non-null    int64
 1   Customer Key         660 non-null    int64
 2   Avg_Credit_Limit     660 non-null    int64
 3   Total_Credit_Cards   660 non-null    int64
 4   Total_visits_bank    660 non-null    int64
 5   Total_visits_online  660 non-null    int64
 6   Total_calls_made     660 non-null    int64
dtypes: int64(7)
memory usage: 36.2 KB
Out[3]:
Data Frame Summary
df2
Dimensions: 660 x 7
Duplicates: 0
No Variable Stats / Values Freqs / (% of Valid) Graph Missing
1 Sl_No
[int64]
Mean (sd) : 330.5 (190.7)
min < med < max:
1.0 < 330.5 < 660.0
IQR (CV) : 329.5 (1.7)
660 distinct values No description has been provided for this image 0
(0.0%)
2 Customer Key
[int64]
Mean (sd) : 55141.4 (25627.8)
min < med < max:
11265.0 < 53874.5 < 99843.0
IQR (CV) : 43377.2 (2.2)
655 distinct values No description has been provided for this image 0
(0.0%)
3 Avg_Credit_Limit
[int64]
Mean (sd) : 34574.2 (37625.5)
min < med < max:
3000.0 < 18000.0 < 200000.0
IQR (CV) : 38000.0 (0.9)
110 distinct values No description has been provided for this image 0
(0.0%)
4 Total_Credit_Cards
[int64]
1. 4
2. 6
3. 7
4. 5
5. 2
6. 1
7. 3
8. 10
9. 9
10. 8
151 (22.9%)
117 (17.7%)
101 (15.3%)
74 (11.2%)
64 (9.7%)
59 (8.9%)
53 (8.0%)
19 (2.9%)
11 (1.7%)
11 (1.7%)
No description has been provided for this image 0
(0.0%)
5 Total_visits_bank
[int64]
1. 2
2. 1
3. 0
4. 3
5. 5
6. 4
158 (23.9%)
112 (17.0%)
100 (15.2%)
100 (15.2%)
98 (14.8%)
92 (13.9%)
No description has been provided for this image 0
(0.0%)
6 Total_visits_online
[int64]
Mean (sd) : 2.6 (2.9)
min < med < max:
0.0 < 2.0 < 15.0
IQR (CV) : 3.0 (0.9)
16 distinct values No description has been provided for this image 0
(0.0%)
7 Total_calls_made
[int64]
Mean (sd) : 3.6 (2.9)
min < med < max:
0.0 < 3.0 < 10.0
IQR (CV) : 4.0 (1.3)
11 distinct values No description has been provided for this image 0
(0.0%)

Preliminary Observations

  1. Dataset contains 660 rows, 7 columns with no missing values; all values are integers, representing customer data (credit and bank interactions)

  2. The features align naturally to the following cateogires: CustomerID; CreditProfile; BankInteraction

    a. Customer ID: SI_No and Customer Key

    b. Credit Profile: Avg_Credit_Limit and Total_Credit_Cards

    c. Bank Interaction): Total_visits_bank, Total_visits_online, and Total_calls_made

  3. Customer Serial Number (Sl_No) has 660 distinct records whereas Customer Identification (Customer Key) has 655 distinct records; Need to review and verify for duplicates

  4. Statistically:

    a. Avg_Credit_Limit has the highest coefficient of variation (CV), thus substantial heterogenity

    b. As a whole, BankInteraction metrics have moderate variability, with CV roughly between 1.0 and 1.5

    • Total_visits_bank has a limited range (0 to 5), with most not exceeding 5 visits; this implies customers' interaction is less reliant on the traditional brick-and-mortar aproach to banking

    • Total_visits_online has a wide range (0 to 15) with high variability (standard deviation 2.9 with a mean of 2.6) compared to physical visits, confirming customers' reliance on virtual over physical banking interactions; this contrasts the BankingInteraction metrics as a whole, and will benefit from deeper exploration

    • Total_calls_made has a relatively consistent variance (standard deviation 2.9 with at a mean at 3.6), with a long tail extending to the right; this makes up a group of outliers, where the subset of customers make significantly more calls than the majority, and will benefit from deeper exploration

    c. Total_Credit_Cards show a low variance (standard deviation of 2.2), suggesting a stable distribution across the population

    d. Long Tails are evident in Avg_Credit_Limit, Total_visits_online, and Total_calls_made, and will benefit from deeper exploration into their respective outliers

  5. The CreditProfile category may represent a "low hanging fruit" investigation opportunity to discover potential hidden relationships, due to the high variability of Avg_Credit_Limit juxtaposed with the low variability of Total_Credit_Cards

Decision Point
  • CustomerID fields are categorical, and can risk introducing noise to the analysis and clustering
  • Create Customer_ID by concatenating Customer Key with SI_No to distinguish between records (perhaps due to historical transactions, shared access within the household, different purposed accounts for the same customer, etc.)
  • SI_No and Customer Key then can be dropped
  • The new Customer_ID can then be indexed as necessary in subsequent studies

CustomerID

In [4]:
# Inspect duplicates

duplicate_keys = df2[df2['Customer Key'].duplicated(keep=False)]

duplicate_keys.groupby('Customer Key').size().reset_index(name='Frequency')

duplicate_keys.sort_values(by='Customer Key')
Out[4]:
Sl_No Customer Key Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
48 49 37252 6000 4 0 2 8
432 433 37252 59000 6 2 1 2
4 5 47437 100000 6 0 12 3
332 333 47437 17000 7 3 1 0
411 412 50706 44000 4 5 0 2
541 542 50706 60000 7 5 2 2
391 392 96929 13000 4 5 0 0
398 399 96929 67000 6 2 2 2
104 105 97935 17000 2 1 2 10
632 633 97935 187000 7 1 7 0
In [5]:
# Create the Customer_ID by concatenating Customer Key and Sl_No
df2['Customer_ID'] = df2['Customer Key'].astype(str) + "_" + df2['Sl_No'].astype(str)

# Review the updated DataFrame
print(df2[['Customer Key', 'Sl_No', 'Customer_ID']].head(20))
    Customer Key  Sl_No Customer_ID
0          87073      1     87073_1
1          38414      2     38414_2
2          17341      3     17341_3
3          40496      4     40496_4
4          47437      5     47437_5
5          58634      6     58634_6
6          48370      7     48370_7
7          37376      8     37376_8
8          82490      9     82490_9
9          44770     10    44770_10
10         52741     11    52741_11
11         52326     12    52326_12
12         92503     13    92503_13
13         25084     14    25084_14
14         68517     15    68517_15
15         55196     16    55196_16
16         62617     17    62617_17
17         96463     18    96463_18
18         39137     19    39137_19
19         14309     20    14309_20
In [6]:
# drop original CustomerID fields

df2 = df2.drop(['Customer Key', 'Sl_No'], axis = 1)
df2
Out[6]:
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made Customer_ID
0 100000 2 1 1 0 87073_1
1 50000 3 0 10 9 38414_2
2 50000 7 1 3 4 17341_3
3 30000 5 1 1 4 40496_4
4 100000 6 0 12 3 47437_5
... ... ... ... ... ... ...
655 99000 10 1 10 0 51108_656
656 84000 10 1 13 2 60732_657
657 145000 8 1 9 1 53834_658
658 172000 10 1 15 0 80655_659
659 167000 9 0 12 2 80150_660

660 rows × 6 columns

In [7]:
df2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 660 entries, 0 to 659
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Avg_Credit_Limit     660 non-null    int64 
 1   Total_Credit_Cards   660 non-null    int64 
 2   Total_visits_bank    660 non-null    int64 
 3   Total_visits_online  660 non-null    int64 
 4   Total_calls_made     660 non-null    int64 
 5   Customer_ID          660 non-null    object
dtypes: int64(5), object(1)
memory usage: 31.1+ KB

CreditProfile

In [8]:
# Preliminary bivariate analysis

import matplotlib.pyplot as plt
import seaborn as sns

# Create a figure with two subplots: one for scatter plot and one for box plot
fig, ax = plt.subplots(2, 1, figsize=(10, 12), sharex=True, gridspec_kw={'height_ratios': [1, 3]})

# Scatter Plot
sns.scatterplot(
    data=df2,
    x='Total_Credit_Cards',
    y='Avg_Credit_Limit',
    ax=ax[0],
    alpha=0.7
)
ax[0].set_title('Scatter Plot of Avg_Credit_Limit vs Total_Credit_Cards')
ax[0].set_ylabel('Avg_Credit_Limit')
ax[0].grid(visible=True)

# Box Plot
sns.boxplot(
    data=df2,
    x='Total_Credit_Cards',
    y='Avg_Credit_Limit',
    ax=ax[1]
)
ax[1].set_title('Box Plot of Avg_Credit_Limit Across Total_Credit_Cards')
ax[1].set_xlabel('Total_Credit_Cards')
ax[1].set_ylabel('Avg_Credit_Limit')
ax[1].grid(visible=True)

# Adjust layout
plt.tight_layout()
plt.show()
No description has been provided for this image
Observations
  • Consistent with intuition: The more credit cards a customer has, the higher their credit limit
  • Few outliers exist in the lower credit card groups, but are less frequent than in the higher groups. These outliers seem meaningful for further analysis, thus K-Medoids will be effective to incorporate these data points proportionally
In [9]:
# Kernel Density Estimation: Evaluate the high variability of Avg_Credit_Limit vs low variability of Total_Credit_Cards

# Coefficient of Variation
cv_credit_limit = (df2['Avg_Credit_Limit'].std() / df2['Avg_Credit_Limit'].mean()) * 100
cv_credit_cards = (df2['Total_Credit_Cards'].std() / df2['Total_Credit_Cards'].mean()) * 100

print(f"Raw Scores - Coefficient of Variation:")
print(f"CV of Avg_Credit_Limit: {cv_credit_limit:.2f}%")
print(f"CV of Total_Credit_Cards: {cv_credit_cards:.2f}%")


import numpy as np

# Log transformation for Avg_Credit_Limit to adjust scale
df2['Log_Avg_Credit_Limit'] = np.log1p(df2['Avg_Credit_Limit'])  # Use log(1 + x) to handle zero values if present

# Overlayed Density Plot
plt.figure(figsize=(10, 6))
sns.kdeplot(df2['Log_Avg_Credit_Limit'], label='Log(Avg_Credit_Limit)', fill=True, color='blue', alpha=0.7)
sns.kdeplot(df2['Total_Credit_Cards'], label='Total_Credit_Cards', fill=True, color='orange', alpha=0.7)

# Plot Titles and Labels
plt.title('Overlayed Distributions of Credit Limit and Total Credit Cards', fontsize=14)
plt.xlabel('Distribution', fontsize=12)
plt.xticks([])
plt.yticks([])
plt.text(0.5, 0.95, "Note: Each region is independent and proportional to its own scale.",
         fontsize=8,
         color="gray",
         ha="center",
         va="center",
         transform=plt.gca().transAxes
        )
plt.text(0.01, -0.05, "Less <---",transform=plt.gca().transAxes, fontsize=12, ha='left', va='center')
plt.text(0.99, -0.05, '---> More', transform=plt.gca().transAxes, fontsize=12, ha='right', va='center')

plt.ylabel('Concentration of Customers', fontsize=12)
plt.legend(fontsize=10)
plt.grid(visible=True)

# Display the plot
plt.tight_layout()
plt.show()
Raw Scores - Coefficient of Variation:
CV of Avg_Credit_Limit: 108.83%
CV of Total_Credit_Cards: 46.06%
No description has been provided for this image
Observations_KDE
  • Log transformation was needed due to the scaling differences between the 2 factors
  • Customers in the orange region have low credit cards and low credit limit (mostly being outside of the blue region)
    1. These customers may have limited banking engagement and/or fewer financial resources
    2. These customers may also have banking relationships elsewhere
    3. Marketing to these customers may be more allusive, since it may entail a long-term endeavor with few "quick wins", thus more of an incidental rather than intentional nature
  • Customers in the blue region have high credit limit
    1. Due to the overlap of orange within this blue region, there is ambiguity as to whether they have many or few credit cards (graph is not to scale)
    2. Ambiguity from the overlapping 2 regions will therefore need further bivariate analysis
    3. AUC Comparison, normalized for absolute scale, along with KMeans Customer-Level Analysis and Segment-Specific Insights can provide a more methodical approach for predictive analysis and marketing with intentionality to these customers than the customers in the orange region
In [10]:
# Further Bivariate Analysis: AUC Comparison

from sklearn.preprocessing import StandardScaler
from scipy.stats import gaussian_kde
from scipy.integrate import quad
'''
import numpy as np
from scipy.stats import gaussian_kde
from scipy.integrate import quad
import matplotlib.pyplot as plt
'''

# Step 1: Standardize both variables
scaler = StandardScaler()
df2[['Standardized_Credit_Limit', 'Standardized_Credit_Cards']] = scaler.fit_transform(
    df2[['Avg_Credit_Limit', 'Total_Credit_Cards']]
)

# Step 2: Define KDEs for standardized data
kde_credit_limit = gaussian_kde(df2['Standardized_Credit_Limit'])
kde_credit_cards = gaussian_kde(df2['Standardized_Credit_Cards'])

# Common X range
x_min = min(df2[['Standardized_Credit_Limit', 'Standardized_Credit_Cards']].min())
x_max = max(df2[['Standardized_Credit_Limit', 'Standardized_Credit_Cards']].max())
x_range = np.linspace(x_min, x_max, 1000)

y_credit_limit = kde_credit_limit(x_range)
y_credit_cards = kde_credit_cards(x_range)

# Step 3: Calculate Overlap
def overlap_area(x):
    return min(kde_credit_limit(x), kde_credit_cards(x))

overlap_auc, _ = quad(overlap_area, x_min, x_max)

# Total AUCs
total_auc_credit_limit = quad(lambda x: kde_credit_limit(x), x_min, x_max)[0]
total_auc_credit_cards = quad(lambda x: kde_credit_cards(x), x_min, x_max)[0]

# Normalize overlap
overlap_ratio_credit_limit = overlap_auc / total_auc_credit_limit
overlap_ratio_credit_cards = overlap_auc / total_auc_credit_cards

# Visualization
plt.figure(figsize=(10, 6))
plt.plot(x_range, y_credit_limit, label='Standardized Avg_Credit_Limit (KDE)', color='blue')
plt.plot(x_range, y_credit_cards, label='Standardized Total_Credit_Cards (KDE)', color='orange')
plt.fill_between(
    x_range,
    np.minimum(y_credit_limit, y_credit_cards),
    color='purple',
    alpha=0.5,
    label='Overlap Region'
)
plt.title('Overlapping AUC Between Standardized Avg_Credit_Limit and Total_Credit_Cards')
plt.xlabel('Standardized Value Range')
plt.ylabel('Density')
plt.legend()
plt.grid()
plt.show()

# Print Results
print(f"Overlap AUC: {overlap_auc:.4f}")
print(f"Total AUC (Standardized Avg_Credit_Limit): {total_auc_credit_limit:.4f}")
print(f"Total AUC (Standardized Total_Credit_Cards): {total_auc_credit_cards:.4f}")
print(f"Overlap as % of Standardized Avg_Credit_Limit AUC: {overlap_ratio_credit_limit:.2%}")
print(f"Overlap as % of Standardized Total_Credit_Cards AUC: {overlap_ratio_credit_cards:.2%}")
No description has been provided for this image
Overlap AUC: 0.6792
Total AUC (Standardized Avg_Credit_Limit): 0.9977
Total AUC (Standardized Total_Credit_Cards): 0.9509
Overlap as % of Standardized Avg_Credit_Limit AUC: 68.08%
Overlap as % of Standardized Total_Credit_Cards AUC: 71.43%
Observations

The earlier sliver of overlap makes up for 68% of the customer dataset. The towering blue and orange regions combined makes up for 32%. Therefore this high absolute overlap area suggests a significant proportion of the distributions align, and that the ranges of credit limits and credit card counts are shared for many customers. The total AUC scores above confirms the Kernel Density Estimations are well-defined and appropriately scaled.

  • Of the purple overlapping region,

    • 68.08% pertains to Avg_Credit_Limit
    • 71.43% pertains to Total_Credit_Cards These percentages suggest a meaningful overlap between the two variables.
  • The remaining ~30% non-overlapping areas may represent unique customer groups that can be further reviewed.

In [11]:
# Credit Profile/Upselling: EDA

from sklearn.cluster import KMeans
import warnings

# Suppress warnings
warnings.filterwarnings("ignore", category=FutureWarning)  # Suppress FutureWarnings
warnings.filterwarnings("ignore", category=UserWarning)    # Suppress UserWarnings

# Extract normalized data
normalized_data = df2[['Standardized_Credit_Limit', 'Standardized_Credit_Cards']]

# Apply KMeans Clustering
kmeans = KMeans(n_clusters=3, random_state=42)
df2['Cluster'] = kmeans.fit_predict(normalized_data)

# Visualize Clusters
plt.figure(figsize=(10, 6))
plt.scatter(
    df2['Standardized_Credit_Limit'],
    df2['Standardized_Credit_Cards'],
    c=df2['Cluster'],
    cmap='viridis',
    alpha=0.6
)
plt.title('Customer Segments Based on Credit Limit and Credit Cards')
plt.xlabel('Standardized Credit Limit')
plt.ylabel('Standardized Total Credit Cards')
plt.colorbar(label='Cluster')
plt.grid()
plt.show()
No description has been provided for this image
EDA_Observations_on_Credit_Profile

The clusters from this preliminary Credit Profile EDA plot can also be helpful to reveal distinct customer segments based on standardized credit limit and credit card counts. Along with the previous EDA on CustomerID, Credit Profile are non-volitional factors that indirectly drive costs, revenue and/or service dissatisfaction. Engaging with these factors are more incidental (i.e. being ready for when the prospect/customer is willing and able to buy). Interpretations will follow after further PCA and Ensemble Clustering analyses.

DecisionPoint_Upsell
  • While the bank is looking to upsell to its existing customers [2], the dataset provides a very limited view on how to directly contribute to any upselling efforts. A potential proxy for upselling can be Total_Credit_Cards if we assume that holding more credit cards will correlate with higher customer value (i.e. higher revenue, loyalty, or engagement)
  • However, even with using Total_Credit_Cards as a proxy, too many cards can lead to diminished returns (i.e. credit risk for customers, high level of service for the bank, etc.)
  • Given the bank's focus on credit cards however, "Upselling" will pertain to credit cards and loan products as data inference can be drawn from the relationships between Avg_Credit_Limit and Total_Credit_Cards

PCA and Ensemble Clustering

Banking_Interaction_(bank_visits,_online_visits,_and_calls_made)

  • Banking Interaction does not represent a goal in this study as there are many ways banking interactions can be a cost to the business, while at the same time, presenting revenue opportunities (thus involving confounding factors)
  • These variables represent volitional factors that influence costs, revenue, and/or service dissatisfaction
  • The limited dataset presents Total_Credit_Cards and Credit_Limit as proxies for upselling [3]
  • Banking Interaction, therefore, is meaningful as it pertains to upselling opportunities, specifically for Credit Card and Loan Products
PCA and Clustering Model Analysis to evaluate:
  1. Upselling Opportunities

  2. Ideal Customer Profile)

  3. Service Dissatisfaction Analysis


UpsellingOpportunities

Upselling: PCA
In [12]:
# Preprocess data, use PCA

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

features = ['Total_Credit_Cards', 'Total_visits_bank', 'Total_visits_online', 'Total_calls_made', 'Avg_Credit_Limit']
scaler = StandardScaler()
normalized_data = scaler.fit_transform(df2[features])

pca = PCA()
pca_data = pca.fit_transform(normalized_data)

# Plot explained variance ratio
import matplotlib.pyplot as plt
plt.plot(range(1, len(features) + 1), pca.explained_variance_ratio_.cumsum(), marker='o')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.show()
No description has been provided for this image
In [13]:
# Review Principal Components: Access PCA loadings
loadings = pd.DataFrame(
    pca.components_,
    columns=features,  # Original feature names
    index=[f'PC{i+1}' for i in range(len(features))]  # Label components
)
print(loadings)
     Total_Credit_Cards  Total_visits_bank  Total_visits_online  \
PC1            0.597679           0.280492             0.111783   
PC2            0.030171          -0.586587             0.665161   
PC3            0.284983           0.613522             0.304948   
PC4            0.741352          -0.445278            -0.318388   
PC5           -0.105122          -0.050586            -0.592200   

     Total_calls_made  Avg_Credit_Limit  
PC1         -0.559129          0.488859  
PC2          0.223527          0.403240  
PC3          0.670351         -0.003461  
PC4          0.235605         -0.308617  
PC5          0.364047          0.709337  
In [14]:
# Visualize Principal Components impact

import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
sns.heatmap(loadings, annot=True, cmap='coolwarm')
plt.title('Upselling PCA Loadings Heatmap')
plt.xticks(fontsize=8, rotation = 45)
plt.show()
No description has been provided for this image
Finding

Although the PCA cumulative explained variance plot suggests 2 components as an inflection point, the heat map of loadings reveals that dropping to 2 components would result in the loss of important feature contributions, particularly from PCs 3, 4, and 5, which capture nuanced and actionable patterns in the data.

Decision Point

Based on the loadings and their contributions across all 5 principal components (PCs), including all 5 components appears to be meaningful, especially for capturing nuanced behaviors and contrasts in the data. This approach will capture detailed behavioral patterns (e.g., identifying low-credit, digitally active customers in PC5, etc.), and/or diversity within the customer base. With the limited dataset, the inclusion of all 5 components will not overly complicate subsequent clustering analysis.

Upselling: Clustering Analysis
In [15]:
# KMeans / GMM / KMedoids
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from pyclustering.cluster.kmedoids import kmedoids
from pyclustering.utils.metric import distance_metric, type_metric
import matplotlib.pyplot as plt
import warnings
import pandas as pd

# Suppress warnings
warnings.filterwarnings("ignore", category=FutureWarning)  # Suppress FutureWarnings
warnings.filterwarnings("ignore", category=UserWarning)    # Suppress UserWarnings

# ===== Step 1: Determine Optimal Number of Clusters (Elbow Plot) =====
# Calculate WCSS for different numbers of clusters
wcss = []
for k in range(1, 10):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(pca_data)
    wcss.append(kmeans.inertia_)

# Plot Elbow Curve
plt.figure(figsize=(5, 3))
plt.plot(range(1, 10), wcss, marker='o')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.title('Elbow Method for Optimal Clusters')
plt.show()

# Optimal number of clusters (can be adjusted based on the Elbow Plot)
optimal_k = 3

# ===== Step 2: Add PCA Features to DataFrame =====
# Define feature column names
features = [f"PC{i+1}" for i in range(pca_data.shape[1])]  # Assuming PCA was used
# Create a DataFrame from PCA data if it isn't already part of df2
for i, feature in enumerate(features):
    df2[feature] = pca_data[:, i]

# ===== Step 3: Apply Clustering Methods =====

# ---- KMeans Clustering ----
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
kmeans_clusters = kmeans.fit_predict(pca_data)
df2['KMeans_Cluster'] = kmeans_clusters

# ---- Gaussian Mixture Model (GMM) ----
gmm = GaussianMixture(n_components=optimal_k, random_state=42)
gmm_clusters = gmm.fit_predict(pca_data)
df2['GMM_Cluster'] = gmm_clusters

# ---- K-Medoids Clustering ----
# Define initial medoids (indices based on domain knowledge or random)
initial_medoids = [0, 50, 100]  # Example indices for 3 clusters
kmedoids_instance = kmedoids(
    pca_data, initial_medoids, metric=distance_metric(type_metric.EUCLIDEAN)
)
kmedoids_instance.process()

# Extract K-Medoids cluster assignments
kmedoids_clusters = kmedoids_instance.get_clusters()
df2['KMedoids_Cluster'] = -1
for cluster_id, indices in enumerate(kmedoids_clusters):
    df2.loc[indices, 'KMedoids_Cluster'] = cluster_id

# ===== Step 4: Analyze Clusters =====

# Function to analyze cluster profiles
def analyze_clusters(df, cluster_column, feature_columns):
    cluster_profiles = df.groupby(cluster_column)[feature_columns].mean()
    cluster_profiles.index = cluster_profiles.index + 1  # Make clusters 1-based index
    return cluster_profiles

# Generate cluster profiles for each method
print("KMeans Cluster Profiles (averages):")
print(analyze_clusters(df2, 'KMeans_Cluster', features))

print("\nGMM Cluster Profiles (averages):")
print(analyze_clusters(df2, 'GMM_Cluster', features))

print("\nKMedoids Cluster Profiles (averages):")
print(analyze_clusters(df2, 'KMedoids_Cluster', features))
No description has been provided for this image
KMeans Cluster Profiles (averages):
                     PC1       PC2       PC3       PC4       PC5
KMeans_Cluster                                                  
1               0.647279 -0.880009 -0.024133  0.032796  0.038631
2               2.992098  3.531877  0.118569 -0.107173 -0.123788
3              -1.783279  0.728079  0.015120 -0.032593 -0.038939

GMM Cluster Profiles (averages):
                  PC1       PC2       PC3       PC4       PC5
GMM_Cluster                                                  
1            0.647279 -0.880009 -0.024133  0.032796  0.038631
2            2.992098  3.531877  0.118569 -0.107173 -0.123788
3           -1.783279  0.728079  0.015120 -0.032593 -0.038939

KMedoids Cluster Profiles (averages):
                       PC1       PC2       PC3       PC4       PC5
KMedoids_Cluster                                                  
1                 0.640276 -0.875990 -0.029436  0.033631  0.037106
2                 2.992098  3.531877  0.118569 -0.107173 -0.123788
3                -1.792937  0.735541  0.024742 -0.034641 -0.036973
Observations
  • The Elbow Plot identifies use of 3 clusters is optimal
  • KMeans and KMedoids results are very consistent, both being centroid-based, particularly for Clusters 1 and 3
  • GMM, being more sensitive to data distribution, captures different clustering behaviors
Upselling: Ensemble Analysis
In [16]:
# 3-Model Ensemble

from sklearn.metrics import silhouette_score
import numpy as np
from scipy.stats import mode

# Assume the previous models have populated the following columns in df2:
# 'KMeans_Cluster', 'GMM_Cluster', 'KMedoids_Cluster'

# Extract cluster labels from the three models
kmeans_labels = df2['KMeans_Cluster'].to_numpy()
gmm_labels = df2['GMM_Cluster'].to_numpy()
kmedoids_labels = df2['KMedoids_Cluster'].to_numpy()

# Combine the cluster labels into a single array
all_labels = np.array([kmeans_labels, gmm_labels, kmedoids_labels])

# Perform majority voting to generate ensemble cluster assignments
ensemble_labels = mode(all_labels, axis=0)[0].flatten()
df2['Ensemble_Cluster'] = ensemble_labels

# Calculate silhouette scores for all models, including the ensemble
kmeans_silhouette = silhouette_score(pca_data, kmeans_labels)
gmm_silhouette = silhouette_score(pca_data, gmm_labels)
kmedoids_silhouette = silhouette_score(pca_data, kmedoids_labels)
ensemble_silhouette = silhouette_score(pca_data, ensemble_labels)

# Print silhouette scores for comparison
print("Silhouette Scores for Clustering Models:")
print(f"KMeans Silhouette Score: {kmeans_silhouette:.4f}")
print(f"GMM Silhouette Score: {gmm_silhouette:.4f}")
print(f"KMedoids Silhouette Score: {kmedoids_silhouette:.4f}")
print(f"3-Model Ensemble Silhouette Score: {ensemble_silhouette:.4f}")
Silhouette Scores for Clustering Models:
KMeans Silhouette Score: 0.5157
GMM Silhouette Score: 0.5157
KMedoids Silhouette Score: 0.5158
3-Model Ensemble Silhouette Score: 0.5157
Findings

Based on the above silhouette scores, GMM's distinctive clustering is not the best fit here. KMeans and KMedoids are the best performing models. An ensemble combining these 2 models will be evaluated next.

In [17]:
# KMeans + KMedoids 2-model Ensemble

# Combine KMeans and KMedoids labels
refined_labels = np.array([kmeans_labels, kmedoids_labels])
ensemble_labels = mode(refined_labels, axis=0)[0].flatten()
df2['Ensemble_Cluster'] = ensemble_labels

# Calculate silhouette score for the refined ensemble
ensemble_silhouette = silhouette_score(pca_data, ensemble_labels)

# Print silhouette scores for comparison
print("Silhouette Scores for Revised Ensemble Clustering Models:")
print(f"KMeans Silhouette Score: {kmeans_silhouette:.4f}")
print(f"KMedoids Silhouette Score: {kmedoids_silhouette:.4f}")
print(f"Refined (2-Model) Ensemble Silhouette Score: {ensemble_silhouette:.4f}")
Silhouette Scores for Revised Ensemble Clustering Models:
KMeans Silhouette Score: 0.5157
KMedoids Silhouette Score: 0.5158
Refined (2-Model) Ensemble Silhouette Score: 0.5158
Findings

The revised ensemble with only KMeans and KMedoids is still not as robust as the individual models. Thus, KMeans and KMedoids will be used, with their complementary strengths.

In [18]:
# Visualize KMeans and KMedoids

import plotly.express as px

# Apply t-SNE
tsne = TSNE(n_components=2, perplexity=30, random_state=42)  # Adjust perplexity as needed
tsne_results = tsne.fit_transform(pca_data)  # Use PCA-reduced data or normalized original data

# Ensure original index is retained
tsne_df2 = pd.DataFrame(tsne_results, columns=['TSNE1', 'TSNE2'], index=df2.index)

# Add cluster labels and original fields
tsne_df2['KMeans_Cluster'] = df2['KMeans_Cluster']
#tsne_df2['GMM_Cluster'] = df2['GMM_Cluster']
tsne_df2['KMedoids_Cluster'] = df2['KMedoids_Cluster']

# Specific fields from df2 for hover information
fields_to_include = ['Avg_Credit_Limit', 'Total_Credit_Cards', 'Total_visits_bank', 'Total_visits_online']
tsne_df2 = tsne_df2.join(df2[fields_to_include])

# Visualize with Plotly for K-Means
fig_kmeans = px.scatter(
    tsne_df2, x='TSNE1', y='TSNE2', color='KMeans_Cluster',
    hover_data=fields_to_include,  # Add fields for hover information
    title='t-SNE Visualization with K-Means Clusters',
    color_continuous_scale='Viridis',
)
fig_kmeans.update_xaxes(showticklabels=False)  # Hide x-axis tick labels
fig_kmeans.update_yaxes(showticklabels=False)  # Hide y-axis tick labels
fig_kmeans.update_layout(autosize=False, width=800, height=350, coloraxis_showscale=False)
fig_kmeans.show()

# Visualize with Plotly for GMM
#fig_gmm = px.scatter(
    #tsne_df2, x='TSNE1', y='TSNE2', color='GMM_Cluster',
    #hover_data=fields_to_include,
    #title='t-SNE Visualization with GMM Clusters',
    #color_continuous_scale='Viridis'
#)
#fig_gmm.update_xaxes(showticklabels=False)  # Hide x-axis tick labels
#fig_gmm.update_yaxes(showticklabels=False)  # Hide y-axis tick labels
#fig_gmm.show()

# Visualize with Plotly for K-Medoids
fig_kmedoids = px.scatter(
    tsne_df2, x='TSNE1', y='TSNE2', color='KMedoids_Cluster',
    hover_data=fields_to_include,
    title='t-SNE Visualization with K-Medoids Clusters',
    color_continuous_scale='Viridis'
)
fig_kmedoids.update_xaxes(showticklabels=False)  # Hide x-axis tick labels
fig_kmedoids.update_yaxes(showticklabels=False)  # Hide y-axis tick labels
fig_kmedoids.update_layout(autosize=False, width=800, height=350, coloraxis_showscale=False)
fig_kmedoids.show()
Observation

Both KMeans and KMedoids provided well-separated clustering, as shown in the t-SNE visualizations, indicating that they successfully identified distinct groups in the data. The nuanced differences in cluster boundaries and shapes between the two models may reflect areas of ambiguity in the data. Leveraging both models offers a complementary perspective, capturing distinct aspects of the clustering structure and providing a more holistic view of the data.

Upselling Opportunities: Interpretations
  1. K-Means Clusters have some overlapping but are mostly separated, indicating mostly clear segmentation of customer groups. This shows segmented customers into distinct groups based on their behavior (e.g., Total_visits_online, Total_Credit_Cards, etc.). This segmentation is suitable for operational simplicity when clear and distinct groups are needed for actionable insights.

  2. K-Medoids Clusters have less sensitivity to outliers. This is evident in the clean delineation between data points in the t-SNE above. This approach balances the robustness of GMM with the clarity of K-Means, especially in handling outliers. Of the three models, this seems preferred, minimizing noise or extreme values.

In [19]:
# Credit Profile EDA interpretations

from sklearn.cluster import KMeans
import warnings

# Suppress warnings
warnings.filterwarnings("ignore", category=FutureWarning)  # Suppress FutureWarnings
warnings.filterwarnings("ignore", category=UserWarning)    # Suppress UserWarnings

print("A Revisit to Credit Profile KMeans Clustering")
print("\n")

# Extract normalized data
normalized_data = df2[['Standardized_Credit_Limit', 'Standardized_Credit_Cards']]

# Apply KMeans Clustering
kmeans = KMeans(n_clusters=3, random_state=42)
df2['Cluster'] = kmeans.fit_predict(normalized_data)

# Visualize Clusters
plt.figure(figsize=(7, 3))
plt.scatter(
    df2['Standardized_Credit_Limit'],
    df2['Standardized_Credit_Cards'],
    c=df2['Cluster'],
    cmap='viridis',
    alpha=0.6
)
plt.title('Customer Segments Based on Credit Profile', fontsize=10)
plt.xlabel('Standardized Credit Limit', fontsize=7)
plt.ylabel('Standardized Total Credit Cards', fontsize=7)
plt.tick_params(axis='x', which='both', bottom=False, labelbottom=False)
plt.tick_params(axis='y', which='both', left=False, labelleft=False)

plt.grid()
plt.show()
A Revisit to Credit Profile KMeans Clustering


No description has been provided for this image
Upsell_Opportunities:_Credit_Profile_implications

Identifying "Upselling Opportunities" from within this limited dataset can be most evident by considering credit cards and loan products. Insights can be derived from the previous plot shown above:

  1. Teal Cluster at the Top Right (High Credit Limit & High Credit Cards)
  • Customers in this cluster have both high standardized credit limits and high standardized credit card counts. This group likely represents premium customers with significant purchasing power and financial engagement. This is a very promising segment. These customers may be ideal candidates for premium products such as high-reward credit cards, investment services, and other exclusive benefits (e.g., concierge services, travel perks). They are likely already engaged with multiple financial products, so marketing should focus on cross-selling or retention strategies.
  1. Purple Cluster at the Bottom Left (Low Credit Limit & Low Credit Cards)
  • Customers in this group have both low credit limits and low card counts. They are likely low-value customers with limited financial engagement. This is probably the least promising segment. These customers might not have the capacity to adopt additional financial products. Marketing efforts could focus on financial education or low-risk credit-building products. Success in capturing this segment seem to be lend more to an incidental approach rather than an intentional one, like for the top right quadrant. Their potential for significant growth is limited, so they may not be worth heavy marketing investment.
  1. Yellow Cluster at the Middle Left (Moderate Credit Cards & Low to Moderate Credit Limit)
  • These customers have moderate card counts but relatively low credit limits. They may already be utilizing their credit limits heavily (possibly maxed out). This is a moderately promising segment. These customers might be good candidates for Credit limit increases (if creditworthiness supports it), and Budgeting tools or financial management products. However, they may represent credit risk if their current limits are already over-utilized.

IdealCustomerProfile_(ICP)

ICP: PCA
In [20]:
# Preprocess and PCA
# Standardize features

from sklearn.preprocessing import StandardScaler

features = ['Avg_Credit_Limit', 'Total_Credit_Cards', 'Total_visits_bank', 'Total_visits_online', 'Total_calls_made']
scaler = StandardScaler()
df2_icp_scaled = scaler.fit_transform(df2[features])

# Apply PCA

from sklearn.decomposition import PCA
import pandas as pd

pca_ICP = PCA()
pca_transformed_ICP = pca_ICP.fit_transform(df2_icp_scaled)

# Add PCA components back to a new DataFrame
df2_icp_pca = pd.DataFrame(
    pca_transformed_ICP, 
    columns=[f'PCA_ICP_{i+1}' for i in range(pca_ICP.n_components_)]
)
In [21]:
# Clustering variables
from sklearn.cluster import KMeans
import warnings

# Suppress warnings
warnings.filterwarnings("ignore", category=FutureWarning)  # Suppress FutureWarnings
warnings.filterwarnings("ignore", category=UserWarning)    # Suppress UserWarnings

# Clustering on the first few PCA components
kmeans_ICP = KMeans(n_clusters=3, random_state=42)
df2['ICP_Cluster'] = kmeans_ICP.fit_predict(df2_icp_pca.iloc[:, :3])
In [22]:
# PCA Loadings Table and Heatmap

# Determine relationships between original features and PCA components
loadings_ICP = pd.DataFrame(
    pca_ICP.components_,
    columns=['Avg_Credit_Limit', 'Total_Credit_Cards', 'Total_visits_bank', 'Total_visits_online', 'Total_calls_made'],
    index=[f'PCA_{i+1}' for i in range(pca_ICP.n_components_)]
)

print("\nContribution Scores from Principal Components \n")
print(loadings_ICP)

# Visualize relationship

import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
sns.heatmap(loadings_ICP, annot=True, cmap='coolwarm')
plt.title('PCA Loadings Heatmap (ICP Study)')
plt.xticks(fontsize=8, rotation = 45)
plt.yticks(fontsize=8)
plt.show()
Contribution Scores from Principal Components 

       Avg_Credit_Limit  Total_Credit_Cards  Total_visits_bank  \
PCA_1          0.488859            0.597679           0.280492   
PCA_2          0.403240            0.030171          -0.586587   
PCA_3         -0.003461            0.284983           0.613522   
PCA_4         -0.308617            0.741352          -0.445278   
PCA_5          0.709337           -0.105122          -0.050586   

       Total_visits_online  Total_calls_made  
PCA_1             0.111783         -0.559129  
PCA_2             0.665161          0.223527  
PCA_3             0.304948          0.670351  
PCA_4            -0.318388          0.235605  
PCA_5            -0.592200          0.364047  
No description has been provided for this image
Baseline

The inclusion of all principal components will be used as a baseline by which reduction of components can be compared.

In [23]:
# Plot cumulative explained variance ratio

# Use the number of features for x-axis limit and cumulative sum of explained variance ratio
plt.figure(figsize=(8, 5))
plt.plot(
    range(1, len(pca_ICP.explained_variance_ratio_) + 1), 
    pca_ICP.explained_variance_ratio_.cumsum(), 
    marker='o', #linestyle='--'
)
plt.xlabel('Number of Principal Components', fontsize=12)
plt.ylabel('Cumulative Explained Variance', fontsize=12)
plt.title('Explained Variance Ratio by Principal Components (ICP)', fontsize=14)
plt.grid(True)
plt.show()
No description has been provided for this image
Observation

A clear inflection point is marked at 2 principal components. Further analysis will be reviewed to compare this reduction to the baseline.

In [24]:
# Evaluate ICP context for components, all vs reduced

# Filter numeric columns (excluding non-numeric columns like 'Customer_ID')
ICP_numeric_cols = [
    'Avg_Credit_Limit',
    'Total_Credit_Cards',
    'Total_visits_bank',
    'Total_visits_online',
    'Total_calls_made'
]

# Ensure 'ICP_Cluster_All' is included in numeric columns for grouping
if 'ICP_Cluster_All' not in ICP_numeric_cols:
    ICP_numeric_cols.append('ICP_Cluster_All')

# Perform clustering with all 5 components
from sklearn.cluster import KMeans

kmeans_all = KMeans(n_clusters=3, random_state=42)
df2['ICP_Cluster_All'] = kmeans_all.fit_predict(pca_transformed_ICP[:, :5])

# Cluster summary for all components
summary_all = df2[ICP_numeric_cols].groupby('ICP_Cluster_All').mean()

# Make the cluster labels 1-indexed
summary_all.index = summary_all.index + 1
summary_all.index.name = 'Cluster (Averages)'


print("Cluster Summary (All Components):")
print(summary_all)

# Summary of Reduced Components
from sklearn.cluster import KMeans
import pandas as pd

# Perform clustering with the first 2 components
kmeans_reduced = KMeans(n_clusters=3, random_state=42)
df2['ICP_Cluster_Reduced'] = kmeans_reduced.fit_predict(pca_transformed_ICP[:, :2])

# Cluster summary based on reduced components
# Use PCA-transformed data ONLY for clustering and grouping
summary_reduced = (
    df2[['ICP_Cluster_Reduced']]
    .join(df2[[
        'Avg_Credit_Limit', 'Total_Credit_Cards', 
        'Total_visits_bank', 'Total_visits_online', 'Total_calls_made'
    ]])
    .groupby('ICP_Cluster_Reduced')
    .mean()
)

# Make cluster labels 1-indexed for readability
summary_reduced.index = summary_reduced.index + 1
summary_reduced.index.name = 'Cluster (Averages)'

print("\nICP Cluster Summary (Reduced Components):")
print(summary_reduced)
Cluster Summary (All Components):
                    Avg_Credit_Limit  Total_Credit_Cards  Total_visits_bank  \
Cluster (Averages)                                                            
1                       33782.383420            5.515544           3.489637   
2                      141040.000000            8.740000           0.600000   
3                       12174.107143            2.410714           0.933036   

                    Total_visits_online  Total_calls_made  
Cluster (Averages)                                         
1                              0.981865          2.000000  
2                             10.900000          1.080000  
3                              3.553571          6.870536  

ICP Cluster Summary (Reduced Components):
                    Avg_Credit_Limit  Total_Credit_Cards  Total_visits_bank  \
Cluster (Averages)                                                            
1                       33782.383420            5.515544           3.489637   
2                       12174.107143            2.410714           0.933036   
3                      141040.000000            8.740000           0.600000   

                    Total_visits_online  Total_calls_made  
Cluster (Averages)                                         
1                              0.981865          2.000000  
2                              3.553571          6.870536  
3                             10.900000          1.080000  
Observation

Results continue to point to a representative reduction from the full use of all components.

In [25]:
# Check how many customers are assigned to the same cluster across methods
comparison = pd.crosstab(df2['ICP_Cluster_All'], df2['ICP_Cluster_Reduced'])
print("\nCluster Assignment Comparison:")
comparison.columns = comparison.columns + 1
comparison.index = comparison.index + 1
print(comparison)
Cluster Assignment Comparison:
ICP_Cluster_Reduced    1    2   3
ICP_Cluster_All                  
1                    386    0   0
2                      0    0  50
3                      0  224   0
Observation

The comparison of cluster assignments between the All Components and Reduced Components strongly supports that the reduced components are adequately representative, with no mixing or overlap. This shows reduced components capture the same patterns as all components with consistency, representativeness and efficiency.

In [26]:
import matplotlib.pyplot as plt

# Create a DataFrame for PCA-transformed data
pca_df = pd.DataFrame(
    pca_transformed_ICP[:, :2],  # Use the first two components
    columns=['PCA_1', 'PCA_2']
)
pca_df['Cluster'] = df2['ICP_Cluster_Reduced']  # Add cluster labels

# Visualize clusters
plt.figure(figsize=(8, 6))
plt.scatter(
    pca_df['PCA_1'], 
    pca_df['PCA_2'], 
    c=pca_df['Cluster'], 
    cmap='viridis', 
    s=50, 
    alpha=0.7
)
plt.title('ICP Customer Clusters (PCA Reduced)', fontsize=14)
plt.xlabel('PCA_1', fontsize=12)
plt.ylabel('PCA_2', fontsize=12)
#plt.colorbar(label='Cluster')
plt.grid(True)
plt.show()
No description has been provided for this image
Observations_PCA_reduced

Whereas in the previous Upselling study, 5 components were necessary to uncover upselling opportunities—exhaustively reviewing both what to avoid (to prevent driving customers away) and what to pursue (to drive revenue)-the focus in the ICP study is on narrowly defining the ideal customer profile (ICP), allowing for a more targeted approach. Whereas the upselling study was divergent in nature, exploring a wide range of possibilities, the ICP study is convergent, honing in on specific traits that define the ideal customer.

Findings

Two principal components are sufficient to create the Ideal Customer Profile. The use of 3 clusters can be clearly confirmed with the PCA plot.

ICP: Ensemble Clustering Analysis
In [27]:
# All components, with Ensemble

from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from pyclustering.cluster.kmedoids import kmedoids
from pyclustering.utils.metric import distance_metric, type_metric
from sklearn.metrics import silhouette_score
import numpy as np
from scipy.stats import mode

# Apply KMeans Clustering for ICP
kmeans_ICP = KMeans(n_clusters=3, random_state=42)
df2['KMeans_Cluster_ICP'] = kmeans_ICP.fit_predict(pca_transformed_ICP)

# Calculate silhouette score for KMeans
silhouette_kmeans_icp_all = silhouette_score(pca_transformed_ICP, df2['KMeans_Cluster_ICP'])

# GMM Clustering
gmm_ICP = GaussianMixture(n_components=3, random_state=42)
df2['GMM_Cluster_ICP'] = gmm_ICP.fit_predict(pca_transformed_ICP)

# Calculate silhouette score for GMM
silhouette_gmm_icp_all = silhouette_score(pca_transformed_ICP, df2['GMM_Cluster_ICP'])

# K-Medoids Clustering
initial_medoids = [0, 50, 100]  # Example indices; modify based on your data
kmedoids_instance_ICP = kmedoids(
    pca_transformed_ICP, initial_medoids, metric=distance_metric(type_metric.EUCLIDEAN)
)
kmedoids_instance_ICP.process()
kmedoids_clusters_ICP = kmedoids_instance_ICP.get_clusters()

# Assign K-Medoids clusters to df2
df2['KMedoids_Cluster_ICP'] = -1
for cluster_id, indices in enumerate(kmedoids_clusters_ICP):
    df2.loc[indices, 'KMedoids_Cluster_ICP'] = cluster_id

# Calculate silhouette score for K-Medoids
silhouette_kmedoids_icp_all = silhouette_score(pca_transformed_ICP, df2['KMedoids_Cluster_ICP'])

# Ensemble Clustering (Majority Voting)
kmeans_labels = df2['KMeans_Cluster_ICP'].to_numpy()
gmm_labels = df2['GMM_Cluster_ICP'].to_numpy()
kmedoids_labels = df2['KMedoids_Cluster_ICP'].to_numpy()

# Combine labels from all models
all_labels = np.array([kmeans_labels, gmm_labels, kmedoids_labels])
ensemble_labels = mode(all_labels, axis=0)[0].flatten()
df2['Ensemble_Cluster_ICP'] = ensemble_labels

# Calculate silhouette score for Ensemble
silhouette_ensemble_icp_all = silhouette_score(pca_transformed_ICP, ensemble_labels)

# Print silhouette scores
print(f"KMeans Silhouette Score (All Components): {silhouette_kmeans_icp_all:.4f}")
print(f"GMM Silhouette Score (All Components): {silhouette_gmm_icp_all:.4f}")
print(f"K-Medoids Silhouette Score (All Components): {silhouette_kmedoids_icp_all:.4f}")
print(f"Ensemble Silhouette Score (All Components): {silhouette_ensemble_icp_all:.4f}")
KMeans Silhouette Score (All Components): 0.5157
GMM Silhouette Score (All Components): 0.5157
K-Medoids Silhouette Score (All Components): 0.5158
Ensemble Silhouette Score (All Components): 0.5157
In [28]:
# KMeans / GMM / KMedoids with first 2 components

from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from sklearn_extra.cluster import KMedoids
from sklearn.metrics import silhouette_score
import numpy as np
from scipy.stats import mode

# KMeans clustering
kmeans = KMeans(n_clusters=3, random_state=42)
df2['KMeans_Cluster_ICP'] = kmeans.fit_predict(pca_transformed_ICP[:, :2])

# Silhouette score for KMeans
silhouette_kmeans_icp = silhouette_score(pca_transformed_ICP[:, :2], df2['KMeans_Cluster_ICP'])

# GMM clustering
gmm = GaussianMixture(n_components=3, random_state=42)
df2['GMM_Cluster_ICP'] = gmm.fit_predict(pca_transformed_ICP[:, :2])

# Silhouette score for GMM
silhouette_gmm_icp = silhouette_score(pca_transformed_ICP[:, :2], df2['GMM_Cluster_ICP'])

# K-Medoids clustering
kmedoids = KMedoids(n_clusters=3, random_state=42, metric='euclidean')
df2['KMedoids_Cluster_ICP'] = kmedoids.fit_predict(pca_transformed_ICP[:, :2])

# Silhouette score for K-Medoids
silhouette_kmedoids_icp = silhouette_score(pca_transformed_ICP[:, :2], df2['KMedoids_Cluster_ICP'])

# Ensemble Clustering (Majority Voting)
kmeans_labels = df2['KMeans_Cluster_ICP'].to_numpy()
gmm_labels = df2['GMM_Cluster_ICP'].to_numpy()
kmedoids_labels = df2['KMedoids_Cluster_ICP'].to_numpy()

# Combine labels from all models
all_labels = np.array([kmeans_labels, gmm_labels, kmedoids_labels])
ensemble_labels = mode(all_labels, axis=0)[0].flatten()
df2['Ensemble_Cluster_ICP'] = ensemble_labels

# Silhouette score for Ensemble
silhouette_ensemble_icp = silhouette_score(pca_transformed_ICP[:, :2], ensemble_labels)

# Print silhouette scores
print(f"KMeans Silhouette Score (ICP): {silhouette_kmeans_icp:.4f}")
print(f"GMM Silhouette Score (ICP): {silhouette_gmm_icp:.4f}")
print(f"K-Medoids Silhouette Score (ICP): {silhouette_kmedoids_icp:.4f}")
print(f"Ensemble Silhouette Score (ICP): {silhouette_ensemble_icp:.4f}")
KMeans Silhouette Score (ICP): 0.6829
GMM Silhouette Score (ICP): 0.6829
K-Medoids Silhouette Score (ICP): 0.5125
Ensemble Silhouette Score (ICP): 0.6829
In [29]:
# First 2 components with KMeans / GMM 2-Model Ensemble

from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
import numpy as np
from scipy.stats import mode

# KMeans clustering
kmeans = KMeans(n_clusters=3, random_state=42)
df2['KMeans_Cluster_ICP'] = kmeans.fit_predict(pca_transformed_ICP[:, :2])

# Silhouette score for KMeans
silhouette_kmeans_icp = silhouette_score(pca_transformed_ICP[:, :2], df2['KMeans_Cluster_ICP'])

# GMM clustering
gmm = GaussianMixture(n_components=3, random_state=42)
df2['GMM_Cluster_ICP'] = gmm.fit_predict(pca_transformed_ICP[:, :2])

# Silhouette score for GMM
silhouette_gmm_icp = silhouette_score(pca_transformed_ICP[:, :2], df2['GMM_Cluster_ICP'])

# Ensemble Clustering (Majority Voting with KMeans and GMM)
kmeans_labels = df2['KMeans_Cluster_ICP'].to_numpy()
gmm_labels = df2['GMM_Cluster_ICP'].to_numpy()

# Combine labels from KMeans and GMM
refined_labels = np.array([kmeans_labels, gmm_labels])
ensemble_labels = mode(refined_labels, axis=0)[0].flatten()
df2['Ensemble_Cluster_ICP'] = ensemble_labels

# Silhouette score for Ensemble
silhouette_ensemble_icp = silhouette_score(pca_transformed_ICP[:, :2], ensemble_labels)

# Print silhouette scores
print(f"KMeans Silhouette Score (ICP): {silhouette_kmeans_icp:.4f}")
print(f"GMM Silhouette Score (ICP): {silhouette_gmm_icp:.4f}")
print(f"Ensemble Silhouette Score (ICP, KMeans + GMM): {silhouette_ensemble_icp:.4f}")
KMeans Silhouette Score (ICP): 0.6829
GMM Silhouette Score (ICP): 0.6829
Ensemble Silhouette Score (ICP, KMeans + GMM): 0.6829
Findings

The silhouette scores show that the 3-Model and the 2-Model Ensembles did not perform any better than the individual models. Likewise, K-Medoids is not suited for this study. So only KMeans and GMM will be used for creating the Ideal Customer profile. Next, the full component should be compared against the first 2 components.

In [30]:
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
import numpy as np
from scipy.stats import mode

# ===== Full Components Clustering =====

# Apply KMeans Clustering for ICP
kmeans_ICP = KMeans(n_clusters=3, random_state=42)
df2['KMeans_Cluster_ICP_All'] = kmeans_ICP.fit_predict(pca_transformed_ICP)

# Calculate silhouette score for KMeans
silhouette_kmeans_icp_all = silhouette_score(pca_transformed_ICP, df2['KMeans_Cluster_ICP_All'])

# GMM Clustering
gmm_ICP = GaussianMixture(n_components=3, random_state=42)
df2['GMM_Cluster_ICP_All'] = gmm_ICP.fit_predict(pca_transformed_ICP)

# Calculate silhouette score for GMM
silhouette_gmm_icp_all = silhouette_score(pca_transformed_ICP, df2['GMM_Cluster_ICP_All'])

# Ensemble Clustering (Majority Voting for Full Components)
kmeans_labels_all = df2['KMeans_Cluster_ICP_All'].to_numpy()
gmm_labels_all = df2['GMM_Cluster_ICP_All'].to_numpy()

# Combine labels from KMeans and GMM
refined_labels_all = np.array([kmeans_labels_all, gmm_labels_all])
ensemble_labels_all = mode(refined_labels_all, axis=0)[0].flatten()
df2['Ensemble_Cluster_ICP_All'] = ensemble_labels_all

# Silhouette score for Ensemble
silhouette_ensemble_icp_all = silhouette_score(pca_transformed_ICP, ensemble_labels_all)

# ===== First Two Components Clustering =====

# KMeans clustering
kmeans_2 = KMeans(n_clusters=3, random_state=42)
df2['KMeans_Cluster_ICP_Reduced'] = kmeans_2.fit_predict(pca_transformed_ICP[:, :2])

# Silhouette score for KMeans
silhouette_kmeans_icp_reduced = silhouette_score(pca_transformed_ICP[:, :2], df2['KMeans_Cluster_ICP_Reduced'])

# GMM clustering
gmm_2 = GaussianMixture(n_components=3, random_state=42)
df2['GMM_Cluster_ICP_Reduced'] = gmm_2.fit_predict(pca_transformed_ICP[:, :2])

# Silhouette score for GMM
silhouette_gmm_icp_reduced = silhouette_score(pca_transformed_ICP[:, :2], df2['GMM_Cluster_ICP_Reduced'])

# Ensemble Clustering (Majority Voting for First Two Components)
kmeans_labels_reduced = df2['KMeans_Cluster_ICP_Reduced'].to_numpy()
gmm_labels_reduced = df2['GMM_Cluster_ICP_Reduced'].to_numpy()

# Combine labels from KMeans and GMM
refined_labels_reduced = np.array([kmeans_labels_reduced, gmm_labels_reduced])
ensemble_labels_reduced = mode(refined_labels_reduced, axis=0)[0].flatten()
df2['Ensemble_Cluster_ICP_Reduced'] = ensemble_labels_reduced

# Silhouette score for Ensemble
silhouette_ensemble_icp_reduced = silhouette_score(pca_transformed_ICP[:, :2], ensemble_labels_reduced)

# ===== Compare Silhouette Scores =====

print("Full Components Clustering Silhouette Scores:")
print(f"KMeans Silhouette Score (All Components): {silhouette_kmeans_icp_all:.4f}")
print(f"GMM Silhouette Score (All Components): {silhouette_gmm_icp_all:.4f}")
print(f"Ensemble Silhouette Score (All Components): {silhouette_ensemble_icp_all:.4f}")

print("\nFirst Two Components Clustering Silhouette Scores:")
print(f"KMeans Silhouette Score (Reduced Components): {silhouette_kmeans_icp_reduced:.4f}")
print(f"GMM Silhouette Score (Reduced Components): {silhouette_gmm_icp_reduced:.4f}")
print(f"Ensemble Silhouette Score (Reduced Components): {silhouette_ensemble_icp_reduced:.4f}")
Full Components Clustering Silhouette Scores:
KMeans Silhouette Score (All Components): 0.5157
GMM Silhouette Score (All Components): 0.5157
Ensemble Silhouette Score (All Components): 0.5157

First Two Components Clustering Silhouette Scores:
KMeans Silhouette Score (Reduced Components): 0.6829
GMM Silhouette Score (Reduced Components): 0.6829
Ensemble Silhouette Score (Reduced Components): 0.6829

Findings

The first two components outperform full components:

  • KMeans (Reduced Components): 0.6829 vs. KMeans (All Components): 0.5157
  • GMM (Reduced Components): 0.6829 vs. GMM (All Components): 0.2625

The first two PCA components capture the majority of the variance and seem to provide better-defined clusters, while the inclusion of additional components in "Full Components" likely introduces noise or irrelevant features, degrading clustering performance. Ensemble Follows the Trends:

For both cases, the ensemble silhouette score aligns closely with the lowest-performing model in the ensemble (e.g., GMM for "Full Components"). In the "First Two Components" scenario, the ensemble matches the top-performing models (KMeans and GMM), as both contribute equally strong clusters. Full Components are Less Effective:

The sharp drop in GMM’s silhouette score for full components indicates that it struggles with the added dimensions, likely due to noise or less relevant patterns in the data. Hereafter, 2-components will be used in the model.

In [31]:
# Create a comparison table for ICP-specific clusters

comparison_table = pd.crosstab(
    df2['KMeans_Cluster_ICP'], 
    df2['GMM_Cluster_ICP'],
    rownames=['KMeans_ICP'],
    colnames=['GMM_ICP']
)

# Make the row index (KMeans_ICP) 1-indexed
comparison_table.index = comparison_table.index + 1
comparison_table.index.name = 'KMeans_ICP (1-Indexed)'

# Make the column index (GMM_ICP) 1-indexed
comparison_table.columns = comparison_table.columns + 1
comparison_table.columns.name = 'GMM_ICP (1-Indexed)'

# Display the updated cross-tab
print("\nComparison Table - 1-Indexed:")
print(comparison_table)
Comparison Table - 1-Indexed:
GMM_ICP (1-Indexed)       1    2   3
KMeans_ICP (1-Indexed)              
1                       386    0   0
2                         0  224   0
3                         0    0  50
In [32]:
import matplotlib.pyplot as plt

# Convert the comparison table to a format suitable for plotting
comparison_table_reset = comparison_table.reset_index()
comparison_table_melted = comparison_table_reset.melt(
    id_vars='KMeans_ICP (1-Indexed)', 
    var_name='GMM_ICP (1-Indexed)', 
    value_name='Count'
)

# Remove rows where Count is zero
comparison_table_melted = comparison_table_melted[comparison_table_melted['Count'] > 0]

# Create the bubble plot
plt.figure(figsize=(9, 5))
bubble_plot = plt.scatter(
    comparison_table_melted['GMM_ICP (1-Indexed)'],
    comparison_table_melted['KMeans_ICP (1-Indexed)'],
    s=comparison_table_melted['Count'] * 10,  # Scale bubble size
    alpha=0.6,
    c='blue',
    edgecolors='black'
)

# Add labels and title
plt.title('Comparison of KMeans and GMM Clusters', fontsize=14)
plt.xlabel('GMM Cluster (1-Indexed)', fontsize=12)
plt.ylabel('KMeans Cluster (1-Indexed)', fontsize=12)
plt.xticks(comparison_table.columns)
plt.yticks(comparison_table.index)
plt.grid(True, linestyle='--', alpha=0.6)

# Add annotations for counts
for _, row in comparison_table_melted.iterrows():
    plt.text(
        row['GMM_ICP (1-Indexed)'], 
        row['KMeans_ICP (1-Indexed)'], 
        str(row['Count']), 
        color='black', 
        ha='center', 
        va='center', 
        fontsize=10
    )

# Show the plot
plt.tight_layout()
plt.show()
No description has been provided for this image
Observation

Grouping the customers within clusters helps with evaluting cluster consistency, as well as identifying stable clusters. There is a strong indication that larger groups of customers are more representative, and when these are common across the models, a robust consensus results. Quantative scoring will add to these insights.

In [33]:
# Visualize with t-SNE

# Apply t-SNE
tsne = TSNE(n_components=2, perplexity=30, random_state=42)  # Adjust perplexity if needed
tsne_results = tsne.fit_transform(pca_transformed_ICP[:, :2])  # Use PCA-reduced data (first 2 components)

# Create a t-SNE dataframe for ICP clusters
tsne_df2_icp = pd.DataFrame(tsne_results, columns=['TSNE1', 'TSNE2'], index=df2.index)

# Add ICP-specific cluster labels and relevant features
tsne_df2_icp['KMeans_Cluster_ICP'] = df2['KMeans_Cluster_ICP']
tsne_df2_icp['GMM_Cluster_ICP'] = df2['GMM_Cluster_ICP']
tsne_df2_icp['KMedoids_Cluster_ICP'] = df2['KMedoids_Cluster_ICP']

# Select hover fields to keep it simple
fields_to_include = ['Avg_Credit_Limit', 'Total_Credit_Cards', 'Total_visits_bank', 'Total_visits_online']
tsne_df2_icp = tsne_df2_icp.join(df2[fields_to_include])

# KMeans
fig_kmeans_icp = px.scatter(
    tsne_df2_icp, x='TSNE1', y='TSNE2', color='KMeans_Cluster_ICP',
    hover_data=fields_to_include,  # Fields for hover information
    title='t-SNE Visualization with KMeans Clusters (ICP)',
    color_continuous_scale='Viridis'
)
fig_kmeans_icp.update_layout(autosize=False, width=800, height=350, coloraxis_showscale=False)
fig_kmeans_icp.show()

# GMM
fig_gmm_icp = px.scatter(
    tsne_df2_icp, x='TSNE1', y='TSNE2', color='GMM_Cluster_ICP',
    hover_data=fields_to_include,
    title='t-SNE Visualization with GMM Clusters (ICP)',
    color_continuous_scale='Viridis'
)
fig_gmm_icp.update_layout(autosize=False, width=800, height=350, coloraxis_showscale=False)
fig_gmm_icp.show()

# K-Medoids
fig_kmedoids_icp = px.scatter(
    tsne_df2_icp, x='TSNE1', y='TSNE2', color='KMedoids_Cluster_ICP',
    hover_data=fields_to_include,
    title='t-SNE Visualization with K-Medoids Clusters (ICP)',
    color_continuous_scale='Viridis'
)
fig_kmedoids_icp.update_layout(autosize=False, width=800, height=350, coloraxis_showscale=False)
fig_kmedoids_icp.show()
Interpretation

Both KMeans and GMM have high silhouette scores, confirmed in their distinct and consistent clusters with clear separations in the t-SNE graph above. They are effective in capturing the underlying structure of the data, making them very useful in defining the Ideal Customer Profile. The above t-SNE confirms that KMedoids will not be helpful for creating this profile. The use of KMeans and GMM, on the other hand, provides a robust consensus clustering, with cluster alignment that can be used to leaverage their respective strengths: KMeans for its simplicity, and GMM for its flexibility in capturing nuanced patterns, especially in overlapping groups.

In [34]:
# Ideal Customer Profile

from IPython.display import Image, display

# Display the image
display(Image(filename='/mnt/e/mikecbos_E/Downloads/MIT_Elective-AllLife/ICP_Personas.png'))


# Get existing cluster combinations in df2
valid_combinations = df2.groupby(['KMeans_Cluster_ICP', 'GMM_Cluster_ICP']).size().reset_index()
valid_combinations.columns = ['KMeans_Cluster_ICP', 'GMM_Cluster_ICP', 'Count']

# Align cluster indexing
ICP_summaries = {}

for kmeans_cluster, gmm_cluster in valid_combinations[['KMeans_Cluster_ICP', 'GMM_Cluster_ICP']].values:
    # Align 1-indexed pertinent_clusters with 0-indexed df2 clusters
    kmeans_filter = kmeans_cluster
    gmm_filter = gmm_cluster

    # Filter rows in df2 matching this cluster combination
    cluster_data = df2[
        (df2['KMeans_Cluster_ICP'] == kmeans_filter) &
        (df2['GMM_Cluster_ICP'] == gmm_filter)
    ]

    # Summarize pertinent features
    ICP_summaries[f"KMeans {kmeans_cluster + 1}, GMM {gmm_cluster + 1}"] = cluster_data[
        ['Avg_Credit_Limit', 'Total_Credit_Cards', 
         'Total_visits_bank', 'Total_visits_online', 'Total_calls_made']
    ].mean()

# Convert ICP summaries to DataFrame
ICP_summaries_df = pd.DataFrame(ICP_summaries).T

# Update the index for readability
ICP_summaries_df.index.name = "Cluster Combination"

# Display the ICP summaries
print("\nIdeal Customer Profiles (ICP Summaries):")
print(ICP_summaries_df)
No description has been provided for this image
Ideal Customer Profiles (ICP Summaries):
                     Avg_Credit_Limit  Total_Credit_Cards  Total_visits_bank  \
Cluster Combination                                                            
KMeans 1, GMM 1          33782.383420            5.515544           3.489637   
KMeans 2, GMM 2          12174.107143            2.410714           0.933036   
KMeans 3, GMM 3         141040.000000            8.740000           0.600000   

                     Total_visits_online  Total_calls_made  
Cluster Combination                                         
KMeans 1, GMM 1                 0.981865          2.000000  
KMeans 2, GMM 2                 3.553571          6.870536  
KMeans 3, GMM 3                10.900000          1.080000  
Personas

Low-income Cluster: KMeans 1, GMM 1

  • Average Credit Limit: $12,174 – Relatively low compared to other clusters.
  • Total Credit Cards: 2.41 – Indicates a conservative number of credit cards.
  • Bank Visits: 0.93 – Very few visits to the bank, suggesting reliance on other channels.
  • Online Visits: 3.55 – Moderate online activity.
  • Calls Made: 6.87 – High reliance on phone interactions.
  • Profile: This persona likely represents low-to-moderate income customers who prefer phone communication and engage moderately with online banking.

Traditional Communication Preference Cluster: KMeans 2, GMM 2

  • Average Credit Limit: $33,782 – Indicates mid-tier customers with good credit access.
  • Total Credit Cards: 5.52 – A significantly higher number of credit cards.
  • Bank Visits: 3.49 – High frequency of bank visits.
  • Online Visits: 0.98 – Very low online activity.
  • Calls Made: 2.00 – Minimal phone engagement.
  • Profile: This persona likely represents traditional customers who rely on in-person banking and have moderate financial resources. Their low online activity indicates limited digital adoption.

Online Communication Preference Cluster: KMeans 3, GMM 3

  • Average Credit Limit: $141,040 – Very high credit limit.
  • Total Credit Cards: 8.74 – A large number of credit cards.
  • Bank Visits: 0.60 – Rarely visits the bank.
  • Online Visits: 10.90 – Heavy online activity.
  • Calls Made: 1.08 – Minimal phone interactions.
  • Profile: This persona represents affluent, tech-savvy customers who prefer online banking and have substantial financial resources. Their low reliance on in-person or phone communication suggests a preference for self-service digital platforms.

ServiceDissatisfactionAnalysis

Exploratory Data Analysis of 5 Components vs 2 Components

Building on the earlier Upselling study, which utilized all components, and the ICP) study, which focused on 2 (reduced) components, these studies together provide a framework for exploring Service Dissatisfaction. Specifically, they help analyze the respective contribution scores of the components and evaluate how these arrangements might apply.

In [35]:
# Heatmap to compare All vs Reduced (2) Components

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Define features to apply PCA
features = ['Avg_Credit_Limit', 'Total_Credit_Cards', 
            'Total_visits_bank', 'Total_visits_online', 'Total_calls_made']

# Standardize features before PCA
scaler = StandardScaler()
df2_scaled = scaler.fit_transform(df2[features])

# Apply PCA
pca = PCA()
pca_transformed_ICP = pca.fit_transform(df2_scaled)


# PCA Loadings Matrix (All Components)
loadings_ICP = pd.DataFrame(
    pca.components_,
    columns=['Avg_Credit_Limit', 'Total_Credit_Cards', 
             'Total_visits_bank', 'Total_visits_online', 'Total_calls_made'],
    index=[f'PCA_{i+1}' for i in range(pca.components_.shape[0])]
)

# Reduced Components Loadings (First 2 Components)
reduced_loadings = pd.DataFrame(
    pca.components_[:2],  # Use the first 2 components
    columns=['Avg_Credit_Limit', 'Total_Credit_Cards', 
             'Total_visits_bank', 'Total_visits_online', 'Total_calls_made'],
    index=['PCA_1', 'PCA_2']  # Label reduced components
)

# Plot side-by-side heatmaps
fig, axes = plt.subplots(1, 2, figsize=(18, 8), gridspec_kw={'width_ratios': [5, 3]})

# All Components Heatmap
sns.heatmap(loadings_ICP, annot=True, cmap='coolwarm', ax=axes[0])
axes[0].set_title('All Components Heatmap', fontsize=14)
axes[0].set_xticklabels(axes[0].get_xticklabels(), fontsize=10, rotation=45)
axes[0].set_yticklabels(axes[0].get_yticklabels(), fontsize=10)
axes[0].set_xlabel('Features', fontsize=12)
axes[0].set_ylabel('Principal Components (All)', fontsize=12)

# Reduced Components Heatmap (First 2 Components)
sns.heatmap(reduced_loadings, annot=True, cmap='coolwarm', ax=axes[1], cbar=False)
axes[1].set_title('Reduced Components Heatmap', fontsize=14)
axes[1].set_xticklabels(axes[1].get_xticklabels(), fontsize=10, rotation=45)
axes[1].set_yticklabels(axes[1].get_yticklabels(), fontsize=10)
axes[1].set_xlabel('Features', fontsize=12)
axes[1].set_ylabel('Principal Components (Reduced)', fontsize=12)

# Adjust layout
plt.tight_layout()
plt.show()
No description has been provided for this image
Observation:_Divergence_vs_Convergence

The ICP prior study was convergent in its focus on identifying the Ideal Customer. Like the Upselling study, this study is also divergent - exploring the multiple ways through which diverse customers experience dissatisfaction. While the Reduced (2) Component arrangement works well for the Ideal Customer Profile and using all components is crucial for Upselling, Service Dissatisfaction requires the consideration of both high positive (red) and high negative (blue) values, rather than focusing solely on high or average values. Highlighting the extreme contribution scores from each principal component will offer valuable insights in this context.

In [36]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Define thresholds for highlighting extreme values
upper_threshold = 0.5  # +high%
lower_threshold = -0.5  # -high%

# Mask all non-extreme values
masked_data = loadings_ICP.copy()  # Assuming `loadings_ICP` is your PCA loadings DataFrame
masked_data[(masked_data < upper_threshold) & (masked_data > lower_threshold)] = np.nan  # Mask non-extreme values

# Plot the heatmap with extreme values highlighted
plt.figure(figsize=(10, 6))
sns.heatmap(masked_data, annot=True, cmap='coolwarm', fmt='.2f', cbar=False, 
            vmax=1, vmin=-1, linewidths=0.5, linecolor='black')

# Set the title
plt.title('', fontsize=14)

# Move Y-axis labels to the top
plt.xticks(fontsize=10, rotation=45)
plt.yticks(fontsize=10)
plt.gca().xaxis.tick_top()  # Move X-axis ticks to the top
plt.gca().xaxis.set_label_position('top')  # Set X-axis labels as top-aligned
plt.xlabel('', fontsize=12)
plt.ylabel('Features', fontsize=12)  # Label Y-axis for clarity

# Adjust layout
plt.tight_layout()
plt.show()

# Define features for SVC
features = ['Avg_Credit_Limit', 'Total_Credit_Cards', 
            'Total_visits_bank', 'Total_visits_online', 'Total_calls_made']

# Standardize features for SVC
scaler = StandardScaler()
df2_scaled_SVC = scaler.fit_transform(df2[features])

# Apply PCA for SVC
pca_SVC = PCA()
pca_transformed_SVC = pca_SVC.fit_transform(df2_scaled)

# Define the PCA Loadings Matrix
loadings_SVC = pd.DataFrame(
    pca_SVC.components_,
    columns=features,
    index=[f'PCA_{i+1}' for i in range(pca_SVC.components_.shape[0])]
)

# Define thresholds for extreme values
upper_threshold = 0.5
lower_threshold = -0.5

# Create reduced subsets
reduced_loadings = {
    "PC1 + PC2": loadings_SVC[:2],
    "PC1 + PC2 + PC3": loadings_SVC[:3],
    "PC1 + PC2 + PC3 + PC4": loadings_SVC[:4]
}

# Apply extreme value mask to each subset
masked_loadings = {
    label: subset.where((subset > upper_threshold) | (subset < lower_threshold), np.nan)
    for label, subset in reduced_loadings.items()
}

# Plot thumbnails in a grid layout
fig, axes = plt.subplots(1, 3, figsize=(20, 6), sharey=False)

for ax, (label, data) in zip(axes, masked_loadings.items()):
    sns.heatmap(data, annot=True, cmap='coolwarm', fmt='.2f', cbar=False, ax=ax)
    ax.set_title(label, fontsize=14)
    ax.set_xlabel('', fontsize=12)
    ax.set_ylabel('', fontsize=12)
    ax.tick_params(axis='x', labelrotation=45, labelsize=10)
    ax.tick_params(axis='y', labelsize=10)

plt.tight_layout()
plt.show()
No description has been provided for this image
No description has been provided for this image
Observation

The above heatmap highlights both high and low contribution scores for each principal component. Reviewing the contribution scores of principal components 2 through 4, specifically in relation to each feature, offers an additional perspective.

DecisionPoint

When reviewing the extreme contribution scores of each principal component to the features, PCA_3 appears redundant due to overlapping feature representation:

  • Total_visits_bank: The contribution of PCA_3 is already well-represented by PCA_2.

  • Total_calls_made: The contribution of PCA_3 is already well-represented by PCA_1.

Explained_Variance_Contribution_as_a_metric

The Explained Variance Contribution metric is ideal for this Service Dissatisfaction analysis because it quantifies how much of the total variance in the data is captured by each principal component. This includes both positive and negative contributions, ensuring a comprehensive view of the data's structure. This metric will help evaluate the exclusion of PCA_3.

In [37]:
# Explained Variance Contribution Comparison

import matplotlib.pyplot as plt

# All Components - Explained Variance Contribution
explained_variance_ratio = pca_SVC.explained_variance_ratio_
cumulative_variance = explained_variance_ratio.cumsum()

# Selected Components - Explained Variance Contribution
selected_indices = [0, 1, 3, 4]  # Indices corresponding to PC1, PC2, PC4, PC5
selected_explained_variance = explained_variance_ratio[selected_indices]
cumulative_selected_variance = selected_explained_variance.cumsum()

# Create a side-by-side comparison plot
fig, axes = plt.subplots(1, 2, figsize=(16, 6), sharey=True)

# Plot All Components
axes[0].bar(
    range(1, len(explained_variance_ratio) + 1), 
    explained_variance_ratio, alpha=0.7, label='Individual Explained Variance'
)
axes[0].step(
    range(1, len(cumulative_variance) + 1), 
    cumulative_variance, where='mid', color='red', label='Cumulative Explained Variance'
)
axes[0].set_title("Explained Variance (All Components)", fontsize=14)
axes[0].set_xlabel("Principal Component", fontsize=12)
axes[0].set_ylabel("Variance Explained", fontsize=12)
axes[0].legend(loc='best')

# Plot Selected Components
axes[1].bar(
    [1, 2, 4, 5], 
    selected_explained_variance, alpha=0.7, label='Individual Explained Variance'
)
axes[1].step(
    [1, 2, 4, 5], 
    cumulative_selected_variance, where='mid', color='red', label='Cumulative Explained Variance'
)
axes[1].set_title("Explained Variance (Reduced Components)", fontsize=14)
axes[1].set_xlabel("Principal Component", fontsize=12)
#axes[1].legend(loc='best')

# Adjust layout
plt.tight_layout()
plt.show()
No description has been provided for this image
Findings
  • PCA_1 and PCA_2 dominate the explained variance, capturing the majority of variability in the data.
  • Efficient Variance Retention: PCA_1, PCA_2, PCA_4, and PCA_5 collectively capture most of the variance, confirming the redundancy of PCA_3, and the above Decision Point to exclude it in this Service Dissatisfaction study.
  • The reduced component set (PCA_1, PCA_2, PCA_4, PCA_5) balances simplicity and variance retention, making it ideal for clustering and interpretation.
In [38]:
# Clustering Analysis: Elbow Plot

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# Index 0 corresponds to PC1, so the desired components are 0, 1, 3, 4
SVC_selected_components = pca_transformed_SVC[:, [0, 1, 3, 4]]

# Define range of cluster numbers to evaluate
cluster_range = range(1, 10)  # Try 1 to 9 clusters
inertia_values = []

# Compute inertia for each number of clusters
for k in cluster_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(SVC_selected_components)
    inertia_values.append(kmeans.inertia_)

# Plot the elbow curve
plt.figure(figsize=(8, 6))
plt.plot(cluster_range, inertia_values, marker='o', linestyle='-')
plt.xlabel('Number of Clusters', fontsize=12)
plt.ylabel('Inertia (Sum of Squared Distances)', fontsize=12)
plt.title('Elbow Plot for Optimal Clusters', fontsize=14)
plt.xticks(cluster_range, fontsize=10)
plt.yticks(fontsize=10)
plt.grid(True)
plt.show()
No description has been provided for this image
Finding

3 Clusters is optimal for this Service Dissatisfaction analysis

In [39]:
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from sklearn_extra.cluster import KMedoids
from sklearn.metrics import silhouette_score
import numpy as np
from scipy.stats import mode

# Evaluation of silhouette scores for 3 Clusters

# Selected PCA components for SVC
SVC_selected_components = pca_transformed_SVC[:, [0, 1, 3, 4]]

# KMeans clustering with 3 clusters
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans_labels = kmeans.fit_predict(SVC_selected_components)

# Add KMeans cluster labels to the DataFrame
df2['SVC_KMeans_Cluster'] = kmeans_labels

# Calculate silhouette score for KMeans
silhouette_kmeans = silhouette_score(SVC_selected_components, kmeans_labels)

# GMM clustering with 3 clusters
gmm = GaussianMixture(n_components=3, random_state=42)
gmm_labels = gmm.fit_predict(SVC_selected_components)

# Add GMM cluster labels to the DataFrame
df2['SVC_GMM_Cluster'] = gmm_labels

# Calculate silhouette score for GMM
silhouette_gmm = silhouette_score(SVC_selected_components, gmm_labels)

# K-Medoids clustering with 3 clusters
kmedoids = KMedoids(n_clusters=3, random_state=42)
kmedoids_labels = kmedoids.fit_predict(SVC_selected_components)

# Add K-Medoids cluster labels to the DataFrame
df2['SVC_KMedoids_Cluster'] = kmedoids_labels

# Calculate silhouette score for K-Medoids
silhouette_kmedoids = silhouette_score(SVC_selected_components, kmedoids_labels)

# Ensemble Voting for Final Clusters
# Combine the cluster assignments from all methods
cluster_results = np.array([kmeans_labels, gmm_labels, kmedoids_labels]).T

# Determine the ensemble cluster assignment using majority voting
ensemble_labels = mode(cluster_results, axis=1)[0].flatten()

# Add ensemble cluster labels to the DataFrame
df2['SVC_Ensemble_Cluster'] = ensemble_labels

# Calculate silhouette score for Ensemble
silhouette_ensemble = silhouette_score(SVC_selected_components, ensemble_labels)

# Print silhouette scores
print(f"KMeans Silhouette Score: {silhouette_kmeans:.4f}")
print(f"GMM Silhouette Score: {silhouette_gmm:.4f}")
print(f"K-Medoids Silhouette Score: {silhouette_kmedoids:.4f}")
print(f"Ensemble Silhouette Score: {silhouette_ensemble:.4f}")
KMeans Silhouette Score: 0.5672
GMM Silhouette Score: 0.5672
K-Medoids Silhouette Score: 0.3787
Ensemble Silhouette Score: 0.5672
Decision Point: Ensemble Clustering

While the silhouette scores for KMeans and GMM are equal and the ensemble does not show a quantitative improvement in clustering quality, leveraging an ensemble approach allows us to combine the strengths of all three models:

  • KMeans' simplicity in identifying well-separated, defined clusters.

  • GMM's probabilistic modeling, which captures overlapping or elliptical clusters and accounts for uncertainty in assignments.

  • K-Medoids' robustness to outliers, providing a valuable perspective on extreme cases that might represent significant dissatisfaction or unique customer behaviors.

This combined framework is particularly valuable when analyzing extreme values (high and low probabilities) and outliers. By integrating the robustness of K-Medoids with the strengths of KMeans and GMM, the ensemble clustering approach uncovers nuanced patterns that may not be evident from any model independently. This is essential for identifying and addressing service dissatisfaction and understanding edge cases, enabling better-targeted strategies and improved service quality.

In [40]:
# Cluster Summary Data Table

# Cluster summary for ensemble clusters
SVC_cluster_summary = df2.groupby('SVC_Ensemble_Cluster')[
    ['Avg_Credit_Limit', 'Total_Credit_Cards', 
     'Total_visits_bank', 'Total_visits_online', 'Total_calls_made']
].mean()

print("Cluster Summary (Average values):")
# 1-Indexed for Cluster ID
SVC_cluster_summary.index = SVC_cluster_summary.index + 1
print(SVC_cluster_summary)
Cluster Summary (Average values):
                      Avg_Credit_Limit  Total_Credit_Cards  Total_visits_bank  \
SVC_Ensemble_Cluster                                                            
1                         33782.383420            5.515544           3.489637   
2                        141040.000000            8.740000           0.600000   
3                         12174.107143            2.410714           0.933036   

                      Total_visits_online  Total_calls_made  
SVC_Ensemble_Cluster                                         
1                                0.981865          2.000000  
2                               10.900000          1.080000  
3                                3.553571          6.870536  
Context for the 3 clusters
  • Cluster 1: Represents low credit limits (avg. $12,174), fewer credit cards (avg. 2.4), moderate online visits (avg. 3.55), and high call activity (avg. 6.87).

  • Cluster 2: Features moderate credit limits (avg. '$33,782), mid-level credit cards (avg. 5.5), high bank visits (avg. 3.49), and minimal online visits (avg. 0.98).

  • Cluster 3: Characterized by high credit limits (avg. $141,040), many credit cards (avg. 8.7), low bank visits (avg. 0.6), and frequent online activity (avg. 10.9), with very few calls (avg. 1.08).

In [41]:
from scipy.spatial.distance import jensenshannon

# Evaluate extreme high and extreme low values between clusters distributions

# Calculate feature distributions for each cluster
feature_distributions = df2.groupby('SVC_Ensemble_Cluster')[features].mean()

# Jensen-Shannon divergence between clusters
js_divergences = np.zeros((len(feature_distributions), len(feature_distributions)))

for i in range(len(feature_distributions)):
    for j in range(len(feature_distributions)):
        js_divergences[i, j] = jensenshannon(feature_distributions.iloc[i], feature_distributions.iloc[j])

# Convert divergence matrix to DataFrame for visualization
js_divergences_df = pd.DataFrame(
    js_divergences, 
    index=[f"Cluster {i+1}" for i in feature_distributions.index],
    columns=[f"Cluster {i+1}" for i in feature_distributions.index]
)

print("Jensen-Shannon Divergence Matrix:")
print(js_divergences_df)
print("\n")

# Heatmap visualization
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
sns.heatmap(js_divergences_df, annot=True, cmap="coolwarm", fmt=".3f")
plt.title("Jensen-Shannon Divergence Between Clusters")
plt.show()
Jensen-Shannon Divergence Matrix:
           Cluster 1  Cluster 2  Cluster 3
Cluster 1   0.000000   0.007553   0.013505
Cluster 2   0.007553   0.000000   0.015791
Cluster 3   0.013505   0.015791   0.000000


No description has been provided for this image
Observations

The Jensen-Shannon Divergence (JSD) measures divergence between cluster distributions taking into account the entire spectrum of values (positive and negative). High divergence is observed between Clusters 1 and 3, and Clusters 1 and 2 (red highlight), indicating distinct behavioral or characteristic patterns. Low divergence (grey highlight) is observed between Cluster 2 and 3, suggesting overlap or shared characteristics.

High divergence highlights clusters at the extremes of service dissatisfaction or user characteristics, essential for targeting specific behaviors or needs. Low divergence reveals clusters with potentially overlapping behaviors, aiding in refining cluster boundaries or exploring transitional patterns.

In [42]:
from sklearn.manifold import TSNE
import plotly.express as px
import pandas as pd
from sklearn.cluster import KMeans

# Step 1: Apply t-SNE to reduce to 2D space
tsne = TSNE(n_components=2, random_state=42, perplexity=30, n_iter=1000)
tsne_results = tsne.fit_transform(SVC_selected_components)

# Step 2: Create a DataFrame for visualization
tsne_df = pd.DataFrame(tsne_results, columns=['t-SNE_1', 't-SNE_2'])
tsne_df['SVC_KMeans_Cluster'] = df2['SVC_KMeans_Cluster']
tsne_df['SVC_GMM_Cluster'] = df2['SVC_GMM_Cluster']
tsne_df['SVC_KMedoids_Cluster'] = df2['SVC_KMedoids_Cluster']

# Step 3: Visualize t-SNE for each clustering model using Plotly

# KMeans Clustering
fig_kmeans = px.scatter(
    tsne_df, 
    x='t-SNE_1', 
    y='t-SNE_2', 
    color='SVC_KMeans_Cluster', 
    title='t-SNE Visualization for KMeans Clustering',
    labels={'color': 'SVC_KMeans Cluster'}
)
fig_kmeans.update_layout(autosize=False, width=800, height=350, coloraxis_showscale=False)  # Set dimensions
fig_kmeans.show()

# GMM Clustering
fig_gmm = px.scatter(
    tsne_df, 
    x='t-SNE_1', 
    y='t-SNE_2', 
    color='SVC_GMM_Cluster', 
    title='t-SNE Visualization for GMM Clustering',
    labels={'color': 'SVC_GMM Cluster'}
)
fig_gmm.update_layout(autosize=False, width=800, height=350, coloraxis_showscale=False)  # Set dimensions
fig_gmm.show()

# KMedoids Clustering
fig_kmedoids = px.scatter(
    tsne_df, 
    x='t-SNE_1', 
    y='t-SNE_2', 
    color='SVC_KMedoids_Cluster', 
    title='t-SNE Visualization for KMedoids Clustering',
    labels={'color': 'SVC_KMedoids Cluster'}
)
fig_kmedoids.update_layout(autosize=False, width=800, height=350, coloraxis_showscale=False)  # Set dimensions
fig_kmedoids.show()

# Select specific principal components (PC1, PC2, PC4, PC5)
# Index 0 corresponds to PC1, so the desired components are 0, 1, 3, 4
SVC_selected_components = pca_transformed_SVC[:, [0, 1, 3, 4]]

# Use the selected components for clustering
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans_labels = kmeans.fit_predict(SVC_selected_components)

# Add cluster labels to the original DataFrame
df2['SVC_Cluster_Selected'] = kmeans_labels

# Filter loadings for PC1, PC2, PC4, PC5
SVC_selected_loadings = loadings_SVC.loc[['PCA_1', 'PCA_2', 'PCA_4', 'PCA_5']]
print("Contribution Weighting: Service Dissatisfaction Loadings (unbounded)")
print("\n")
print(SVC_selected_loadings)
Contribution Weighting: Service Dissatisfaction Loadings (unbounded)


       Avg_Credit_Limit  Total_Credit_Cards  Total_visits_bank  \
PCA_1          0.488859            0.597679           0.280492   
PCA_2          0.403240            0.030171          -0.586587   
PCA_4         -0.308617            0.741352          -0.445278   
PCA_5          0.709337           -0.105122          -0.050586   

       Total_visits_online  Total_calls_made  
PCA_1             0.111783         -0.559129  
PCA_2             0.665161          0.223527  
PCA_4            -0.318388          0.235605  
PCA_5            -0.592200          0.364047  
Observations

The t-SNE visualizations confirms the valid cluster separations, and the corresponding Contribution Weighting table provides insights into the drivers of cluster formations:

  • PCA_1 is heavily influenced by credit-related factors and inversely related to call activity. This suggests any dissatisfaction might be related to account or product management, and less so on customer service calls
  • PCA_2 is driven by online interactions over in-person banking. This suggests any dissatisfaction might be related to frustrations with online engagement, and less influenced by in-person visits.
  • PCA_4 has strong negative contribution from Total_Credit_Cards, and moderately positive contributions from combined Total_visits_bank and Total_visits_online. This suggests any dissatisfaction is tied to low credit card usage, and frequent in-person and online interactions.
  • PCA_5 has strong negative contribution from Avg_Credit_Limit and moderately positive contributions from Total_visits_online and Total_Credit_Cards. This suggests dissatisfaction may be associated with lower credit limits and high online activity.
In [43]:
# t-SNE Ensemble

tsne_df['SVC_Ensemble_Cluster'] = df2['SVC_Ensemble_Cluster']
fig = px.scatter(
    tsne_df,
    x='t-SNE_1',
    y='t-SNE_2',
    color='SVC_Ensemble_Cluster',
    title='t-SNE Visualization for Ensemble Clustering',
    labels={'color': 'SVC Ensemble Cluster'}
)
fig.update_layout(autosize=False, width=800, height=350, coloraxis_showscale=False)
fig.show()
Observation

The above ensemble clustering demonstrates high fidelity as a collective representation of individual models by consolidating their strengths: KMeans' simplicity, GMM's probabilistic overlaps, and KMedoids' outlier handling. This integration forms a robust foundation for uncovering unique insights into service dissatisfaction, examining edge cases, and analyzing cluster overlaps. By leveraging the ensemble's comprehensive perspective, we can identify nuanced patterns and address specific business challenges effectively. The ensemble together represents a balanced, integrated view, visualized by the clusters-to-components 3D plot below.

In [44]:
# 3D Visualize Components within Clusters

# Create a DataFrame with reduced components for interactive visualization
# Adjust the column names to match the number of components
pca_df = pd.DataFrame(
    SVC_selected_components, 
    columns=['PC1', 'PC2', 'PC4', 'PC5']  # Use only the selected components
)
pca_df['Cluster'] = df2['SVC_Ensemble_Cluster'] + 1 # Add cluster labels


# Define which components to plot
component_x = 'PC5'  # Replace with desired component
component_y = 'PC2'  # Replace with desired component
component_z = 'PC4'  # Replace with desired component
color_component = 'PC1'  # Component for coloring the points

hover_data = {
    'Cluster': True,
    'PC1': True, 
    'PC2': True,
    'PC4': True,
    'PC5': True
}

fig = px.scatter_3d(
    pca_df, 
    x=component_x, 
    y=component_y, 
    z=component_z, 
    color=color_component,  # Color points by the selected component
    title=f"3D Visualization ({component_x} vs {component_y} vs {component_z}, Color: {color_component})",
    labels={'color': f"{color_component}"},  # Label the color legend
    opacity=0.7,
    hover_data=hover_data
)

# Update layout for better visuals
fig.update_layout(
    width=800,
    height=500,
    scene=dict(
        camera=dict(
            up=dict(x=0, y=0, z=1),  # Standard upward orientation
            center=dict(x=0, y=0, z=0),  # Center the plot at the origin
            eye=dict(x=1.5, y=0.5, z=1.25)  # Camera position
        ),
        xaxis_title=component_x,
        yaxis_title=component_y,
        zaxis_title=component_z,
        xaxis=dict(showticklabels=False),
        yaxis=dict(showticklabels=False),
        zaxis=dict(showticklabels=False),
        aspectratio=dict(
            x=1,
            y=1,
            z=0.8)
    ),
    margin=dict(
        l=100,
        r=1,
        t=25,
        b=1
    ),
    coloraxis_colorbar=dict(title=color_component)
)

# Show the plot
fig.show()
Observations

The 3D plot reveals clear and distinct clusters, confirming the separation identified through PCA and clustering. The color gradient representing PC1 (which accounts for the majority of variance, as shown in the Explained Variance Contribution Comparison) highlights areas of intensity, such as dissatisfaction extremes driven by credit limits and channel reliance. This plot consolidates the analysis, illustrating the interplay among all four components and their contributions to cluster formation.

  • Cluster 1 can be characterized as High Call Depdency: Dissatisfaction arises from financial constraints and high reliance on call-based support.
  • Cluster 2 can be characterized as Balanced Usage: Moderate engagement across channels but potential dissatisfaction due to moderate financial access.
  • Cluster 3 can be characterized as Digital-First, High-Credit: High online usage with minimal in-person or call dependency, but dissatisfaction may arise from unmet digital service expectations.

Conclusion & Recommendations

Conclusion

This comprehensive study offers a multi-faceted view of the customers and banking interactions at AllLife Bank. The dataset reveals insights into both volitional) and non-volitional factors that shape this complex and dynamic environment. Divergent and convergent analyses highlight both incidental and intentional approaches to customer engagement. Successful implementation of the study's findings will require a focus-group approach to validate and refine these insights, combined with an iterative feedback loop to ensure the findings remain as dynamic as the evolving landscape of AllLife Bank.

Recommendations

Use cluster segmentations from this study (K-Means and K-Medoids) in a focus group approach to validate hypotheses and refine upselling strategies. Steps for hypothesis testing would include: 1) Using the identified clusters to define the hypotheses, 2) Select samples from diverse customers from each cluster, and 3) Test and Validate by exploring preferences qualitatively in a focus group approach, and with A/B Testing to measure reponse to targeted offers. Feedback from these findings should be used iteratively to refine the segments and adjust profiles and strategies.

This approach will empower management teams to 1) Validate assumptions while reducing risk, 2) Engage stakeholders for actionable insights, and 3) Create a feedback loop for iterative improvement to ensure upselling strategies are data-driven and effectively tailored to customer behavior.

Use the personas from this study to develop Ideal Customer Profiles (ICPs) to tailor strategies for engagement, retention, and growth. For Low-to-Moderate Prospects/Customers, offer simple, accessible services via phone and online while providing financial literacy programs to empower resource management. This is an incidental approach to stay engaged in their lives if circumstances change. For Traditional Prospects/Customers, focus on personalized in-branch experiences while encouraging digital adoption with incentives to reduce in-branch operational costs. This approach is more intentional to foster loyalty. For Affluent, Tech-Savvy Prospects/Customers, be intentional with this very promising segment to enhance digital platforms to meet high-tech expectations. Premium services (i.e. exclusive rewards or concierge banking) are other examples of intentional engagement with them.

The next step for creating the Ideal Customer Profile is to validate findings with domain experts to ensure strategic alignment and monitor and update profiles based on evolving customer behavior. This will ensure targeted outreach programs.

Customers' perception of support services must be improved by tailoring customer support strategies to the distinct needs of clusters identified in this study. Drivers of dissatisfactions were segmented by an ensemble clustering analysis. Here are specific actions associated with each:

  1. High Call Depedency / Low Credit
  • Reduce over-reliance on their calls and offer self-service, AI-enabled proactive guidance
  • Evaluate the strategic value of fostering incidental engagement with these customers against the opportunity cost of prioritizing intentional engagement with higher-value customers
  • Package low-cost, low-service, low-maintenance, self-service products to reduce the need for servicing, and thus improve the perception of poor service quality
  1. Balanced Usage / Moderate Credit
  • Maintain a balance across in-person, online, and call services
  • Intentionally target this segment for service improvement metrics, create word-of-mouth marketing, and leverage their testimonials to grow their business while combating negative perceptions of poor service quality
  1. Digital-First / High Credit
  • Enhance digital platforms to meet high-credit customers' expectations
  • Offer exclusive digital perks and tools to elevate prestige and self-perception
  • Use ongoing feedback and adjust support strategies and validate improvements in service perception

Cluster-specific needs as identified by this study can be leveraged to address perception of poor service quality and enhance customer experience and loyalty.